Problem Setup

We're now ready to get started with rigorously defining how policy gradient methods will work.

M3L3 C04 V2

## Important Note

Before moving on, make sure it's clear to you that the equation discussed in the video (and shown below) calculates an expectation.

U(\theta) = \sum_\tau \mathbb{P}(\tau;\theta)R(\tau)

To see how it corresponds to the expected return, note that we've expressed the return R(\tau) as a function of the trajectory \tau. Then, we calculate the weighted average (where the weights are given by \mathbb{P}(\tau;\theta)) of all possible values that the return R(\tau) can take.

## Why Trajectories?

You may be wondering: why are we using trajectories instead of episodes? The answer is that maximizing expected return over trajectories (instead of episodes) lets us search for optimal policies for both episodic and continuing tasks!

That said, for many episodic tasks, it often makes sense to just use the full episode. In particular, for the case of the video game example described in the lessons, reward is only delivered at the end of the episode. In this case, in order to estimate the expected return, the trajectory should correspond to the full episode; otherwise, we don't have enough reward information to meaningfully estimate the expected return.